Acknowledgement

I want to thank my mentors for their valuable inputs to this project:

Dr. Blanca Himes, Associate Professor, Biostatistics, Epidemiology and Informatics; for helping me come up with the objectives of my project and timely feedback,

Dr. Jesse Hsu, Assistant Professor, Biostatistics, Epidemiology and Informatics; for guiding me through the descriptive statistics and data presentation,

Sherrie Xie, VMD-PhD Candidate, Epidemiology; for helping me with the GIS maps.

Overview

My project aims to summarize and visualize the trends in the use of animals in biomedical research by species, pain levels and state in the US the over the period from 2008 to 2019. The data is obtained from the US Department of Agriculture’s APHIS annual reports. Following this exploratory analysis, a prediction for the number of animals that will be used over the next 5 years will be made.

Introduction

The use of animals in biomedical research is ubiquitous. There are two main categories for animal use: basic biomedical sciences research (mimicking a human disease in an animal model and studying gene expression, molecular mechanisms, etc.) and drug testing (toxicity and efficacy of an experimental drug). A growing body of studies and voices from the scientific community have pointed out the poor reliability and predictive value of animal models for human outcomes and for understanding human physiology (Akhtar 2015). In 2004, the FDA estimated that 92% of drugs that pass preclinical tests in animal models fail in human clinical trials (Harding 2004). More recent analysis suggests that, in spite of efforts to improve the predictability of animal testing, the failure rate has increased to 96% (Pippin 2012).

In 1938, the Congress passed the U.S. Federal Food, Drug, and Cosmetic Act, mandating animal toxicity testing. As of October 7, 2021, the Congress has introduced the bipartisan FDA Modernization Act to end animal testing mandates. This comes after the European Parliament resoundingly passed a resolution on September 16, 2021, with a vote of 667 to 4 to phase out animal testing. How has animal use in research changed over the past few years in the US? What is the significance of species, state, pain levels? Using the limited USDA data, I aim to answer these questions. Furthermore, a regression model will predict the number of animals that will be needed in the next few years if the same trends continue. This analysis could be of interest to biomedical companies to reduce their time and resources spent on animal models, FDA for revision of regulatory requirements, bioethics specialists and animal advocacy groups.

The trends in animal use could be influenced by technologies that enhanced the ease of building animal models, technologies that performed better than animal models, change in bioethical standards, change in public sentiment about the topic and even opinions voiced by the heads of government regulatory bodies. I am personally interested in this analysis because one of my career goals is to work towards developing and commercializing alternatives to animal methods in biomedical research.

Loading required packages

options(scipen = 999)
knitr::opts_chunk$set(echo = TRUE)
library("tidyverse")
library("dplyr")
library("ggplot2")
library("gridExtra")
library("sf")
library("spData")
library("mapview")
library("leaflet")

Cleaning and Loading Data

The US Department of Agriculture publishes an annual report of the number of animals used by state, species and pain category. The report for every year is published in January two years later. Hence the latest data I could obtain was from 2019. The reports were published starting 2008, hence I have data for a 12 year period.

The data is in the form of several PDFs. I cleaned the data manually into .txt format. The files are available in my github repo for reproduction of results. The nomenclature is year_col_X where X is the pain category. The pain categories are:

  • Column B: Animals held by a facility but not used in any research that year

  • Column C: Animals used in research; no pain involved; no pain drugs administered

  • Column D: Animals used in research; pain involved; pain drugs administered

  • Column E: Animals used in research; pain involved; no pain drugs administered.

The following code chunk loads the data as a dataframe, stacked by year and column. The loaded data has 53 rows per year per column. This includes the 50 states, District of Columbia, Puerto Rico, and “REPORT TOTAL” which is the sum of all columns (species). Note that the species column is not an exhaustive list of all animals used in biomedical research but only those covered by the Animal Welfare Act. The AWA does not cover rats, mice, birds and reptiles, which happen to be over 95% of all animals used. Columns B, C, D and E are assigned values 0, 1, 2 and 3 to reflect pain levels in a more intuitive way.

#Loading the data using nested functions
datafr <- purrr::map_dfr(
  .x = c(2008:2019), 
  .f = function(x){
    purrr::map_dfr(
  .x = c("B", "C", "D", "E"),
  .f = function(x, y){
    dat <- read.table(file = paste0("C:/Users/Deeksha Hegde/Downloads/BMIN503_Final_Project/USDA_Data/",y,"_col_",x,".txt"), header = TRUE, sep = "\t")
    dat %>%
      mutate(across(.cols = All_Other_Covered_Species:Total, .fns =~ as.numeric(str_remove_all(string = .x, ","))), 
        Year = y,
        Column = x
            )
    }, y=x
    )
  }
)

#Checking if the data loaded is correct
datafr %>% count(Year, Column)
##    Year Column  n
## 1  2008      B 53
## 2  2008      C 53
## 3  2008      D 53
## 4  2008      E 53
## 5  2009      B 53
## 6  2009      C 53
## 7  2009      D 53
## 8  2009      E 53
## 9  2010      B 53
## 10 2010      C 53
## 11 2010      D 53
## 12 2010      E 53
## 13 2011      B 53
## 14 2011      C 53
## 15 2011      D 53
## 16 2011      E 53
## 17 2012      B 53
## 18 2012      C 53
## 19 2012      D 53
## 20 2012      E 53
## 21 2013      B 53
## 22 2013      C 53
## 23 2013      D 53
## 24 2013      E 53
## 25 2014      B 53
## 26 2014      C 53
## 27 2014      D 53
## 28 2014      E 53
## 29 2015      B 53
## 30 2015      C 53
## 31 2015      D 53
## 32 2015      E 53
## 33 2016      B 53
## 34 2016      C 53
## 35 2016      D 53
## 36 2016      E 53
## 37 2017      B 53
## 38 2017      C 53
## 39 2017      D 53
## 40 2017      E 53
## 41 2018      B 53
## 42 2018      C 53
## 43 2018      D 53
## 44 2018      E 53
## 45 2019      B 53
## 46 2019      C 53
## 47 2019      D 53
## 48 2019      E 53
datafr <- mutate(datafr, pain.level = factor(Column, levels = c("B", "C", "D", "E"), labels = c(0, 1, 2, 3)))
head(datafr)
##   State All_Other_Covered_Species Cats Dogs Guinea_Pigs Hamsters Marine.Mammals
## 1    AK                       426    0    6           0      190              0
## 2    AL                       132   34  110           0        0              0
## 3    AR                        92   69   22          15        0              0
## 4    AZ                       144    8   26           0        0              0
## 5    CA                      1870  409  118         735       60              0
## 6    CO                       358  253    4         423      103              0
##   Nonhuman_Primates Other_Farm_Animals Pig Rabbits Sheep Total Year Column
## 1                 0                  0   0       0     0   622 2008      B
## 2                 0                612   4     321     6  1219 2008      B
## 3                34                  0   0       0     0   232 2008      B
## 4                38                  1  12      32     3   264 2008      B
## 5              6009               4473  45    4438   179 18336 2008      B
## 6                 0                  0   0       0     0  1141 2008      B
##   pain.level
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

Methods

This project is divided into two main parts: descriptive statistics and prediction model. In the first part, I performed an exploratory analysis of the data, visualizing by year, state, and species. In the second part, a model to predict the numbers for the next 5 years was created.

Data Visualization and Exploratory Analysis

Now let us visualize the numbers!

The numbers by year

The first plot shows the total animals (all categories combined) by year. The second plot shows the same but with the breakup by column.

data.total <- filter(datafr, State == "REPORT TOTAL")

#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))

total.plot1 <- ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Total by year and column
total.plot2 <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )
grid.arrange(total.plot1, total.plot2, ncol = 2)

We see that there is a huge rise in the total from 2009 to 2010, and a steady decline since then. Overall, there is a significant decline over the 12 year period.

From the second plot, we see that pain category 1, which involves no pain and no pain drugs, is the highest in magnitude throughout the study period. We also see that the decline in the total animals is mainly due to the decline in animals at pain level 1.

The numbers by state

Now we will examine the total numbers by state.

#Total by column by state
data.by.state <- datafr %>%
  filter(State != "REPORT TOTAL") %>%
  group_by(State, pain.level) %>%
  summarise(Total = sum(Total))

data.by.state.total <- group_by(data.by.state, State) %>%
  summarize(Total.all = sum(Total))
state.plot.1 <- ggplot(data.by.state.total, aes(x = State, y = Total.all)) + geom_bar(stat = "identity") + ggtitle("Total number of animals used by state") + ylab(label = "Total across all pain levels in hundred thousands") + scale_y_continuous(breaks = seq(100000, 1300000, by = 100000), labels = seq(1, 13, 1)) +
theme(
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
        axis.text.y = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

state.plot.2 <- ggplot(data.by.state, aes(x = State, y = Total, color = pain.level)) + geom_point(size = 4) + ggtitle("Total number of animals used by state by pain level") + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + ylab(label = "Total in hundred thousands") +
theme(
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
        axis.text.y = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

state.plot.1

state.plot.2

From the graphs, we get the following information:

States using the highest number of animals overall across all pain levels: CA, MA, NJ

States using the highest number of animals by pain level:

  • 0 CA, MD, TX

  • 1 CA, MA, OH

  • 2 CA, TX, MA

  • 3 MO, MI, IA

An arguably better representation of the same data is the use of interactive maps.

usa <- st_read("C:/Users/Deeksha Hegde/Downloads/us-state-boundaries/us-state-boundaries.shp") %>%
  rename(State = stusab)
## Reading layer `us-state-boundaries' from data source 
##   `C:\Users\Deeksha Hegde\Downloads\us-state-boundaries\us-state-boundaries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS:  WGS 84
head(usa)
## Simple feature collection with 6 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.8485 ymin: 14.03656 xmax: 146.1544 ymax: 49.00241
## Geodetic CRS:  WGS 84
##   gid    arealand division intptlat
## 1  16   278176477        0 18.21765
## 2  23   472276664        0 14.93678
## 3  31  1627312771        7 34.89553
## 4  35  2136109036        5 38.64729
## 5  40 -1616974352        1 41.59742
## 6  54   312831514        9 47.40732
##                                           name objectid   areawater   intptlon
## 1                                  Puerto Rico       50   628200285  -66.41080
## 2 Commonwealth of the Northern Mariana Islands       36   349301029  145.60102
## 3                                     Arkansas       44 -1334552525  -92.44463
## 4                                West Virginia        1   489848791  -80.61833
## 5                                 Rhode Island        6  1323457457  -71.52727
## 6                                   Washington       20  -324557627 -120.57580
##           oid funcstat    centlon State state  statens  centlat
## 1   303146031        A  -66.41476    PR    72 01779808 18.21647
## 2 -1625647860        A  145.59687    MP    69 01779809 16.79744
## 3   266078934        A  -92.44270    AR    05 00068085 34.89402
## 4 -1929409300        A  -80.62346    WV    54 01779805 38.64119
## 5 -1861167639        A  -71.52472    RI    44 01219835 41.59403
## 6 -1859906639        A -120.59557    WA    53 01779804 47.41490
##                                       basename mtfcc region lsadc geoid
## 1                                  Puerto Rico G4000      9    00    72
## 2 Commonwealth of the Northern Mariana Islands G4000      9    00    69
## 3                                     Arkansas G4000      3    00    05
## 4                                West Virginia G4000      3    00    54
## 5                                 Rhode Island G4000      1    00    44
## 6                                   Washington G4000      4    00    53
##                         geometry
## 1 MULTIPOLYGON (((-67.20794 1...
## 2 MULTIPOLYGON (((145.5726 15...
## 3 MULTIPOLYGON (((-94.55218 3...
## 4 MULTIPOLYGON (((-81.74725 3...
## 5 MULTIPOLYGON (((-71.7897 41...
## 6 MULTIPOLYGON (((-123.2479 4...
to_map <- inner_join(usa, data.by.state.total, by = "State")

pal_fun <- colorNumeric("Blues", NULL) #Pick a different pal, not starting from white
pu_message <- paste0(to_map$State,  
                     "<br>Number of animals used: ",       
                     prettyNum(to_map$Total.all, big.mark = ","))

leaflet(to_map) %>%
  addPolygons(stroke = FALSE,
              fillColor = ~pal_fun(Total.all),
              fillOpacity = 0.8, smoothFactor = 0.5, 
              popup = pu_message) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addLegend("bottomright",                           
            pal=pal_fun,                             
            values=~Total.all,                 
            title = 'Total number of animals used',                  
            opacity = 1) %>%                         
  addScaleBar()

Now, since we’re in Pennsylvania, let us see what the plots for PA look like.

#Total for PA
data.PA <- filter(datafr, State == "PA")
PA.plot.1 <- ggplot(data = summarise(group_by(data.PA, Year), Total = sum(Total)), aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year in Pennsylvania") + theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Total by column for PA
PA.plot.2 <- ggplot(data = data.PA, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) +
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Pennsylvania") + theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

grid.arrange(PA.plot.1, PA.plot.2, ncol = 2)

We see a huge decline from 2009 to 2015.

Next, I want to find out which states contributed the most to the decline observed in the first plot. I calculated the mean and standard deviation over the years for each state and sorted them in decreasing order of SD. Noticing that Arizona seems to have a higher SD than mean, which raised suspicion, I calculated the coefficient of variability (CV) and sorted in the decreasing order of CV.

#Std dev of total for all states
data.by.state.year <- datafr %>%
  filter(State != "REPORT TOTAL") %>%
  group_by(State, Year) %>%
  summarise(Total = sum(Total))
  
state.std.dev <- summarise(data.by.state.year, mean = mean(Total), sd = sd(Total))
state.std.dev[order(state.std.dev$sd, decreasing = TRUE),]
## # A tibble: 52 x 3
##    State    mean     sd
##    <chr>   <dbl>  <dbl>
##  1 AZ     16261. 35984.
##  2 NY     45696. 25676.
##  3 CA    103133. 22675.
##  4 KS     11863. 20574.
##  5 PA     50591. 20247.
##  6 OH     64393. 18341.
##  7 IA     30172  15683.
##  8 NJ     67210. 13946.
##  9 MO     36277. 13905.
## 10 MD     58124. 13290.
## # ... with 42 more rows
state.std.dev <- state.std.dev %>% mutate(cv = sd/mean) %>% 
                  arrange(desc(cv))
state.std.dev
## # A tibble: 52 x 4
##    State   mean     sd    cv
##    <chr>  <dbl>  <dbl> <dbl>
##  1 AZ    16261. 35984. 2.21 
##  2 KS    11863. 20574. 1.73 
##  3 WV     2116.  2936. 1.39 
##  4 NE     4406.  3886. 0.882
##  5 AK      636.   468. 0.735
##  6 HI      337    227. 0.675
##  7 DC     7995.  5279. 0.660
##  8 WY      543.   307. 0.565
##  9 NY    45696. 25676. 0.562
## 10 IN    17136.  9132. 0.533
## # ... with 42 more rows

The states having CV > 1 will be examined further.

#Total by column for AZ

data.AZ <- filter(datafr, State == "AZ")
ggplot(data = data.AZ, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona")

Column C (pain level 1) of the year 2010 is clearly an outlier. I verified the data again from the USDA report to check if there was an error in data loading. It is most likely a typographical error in the USDA report. On examination, I found that the outlier is coming from the All_Other_Covered_Species column. I first replaced this value with NA and found the mean along the column for the rest of the years. I replaced the NA with the rounded mean and finally updated the total. The updated plot for AZ is also shown.

#Setting the outlier point to NA
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- NA

#Setting the outlier point to the mean of the column over the years excluding outlier year
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- round(mean(data.AZ[, 2][data.AZ['Column'] == "C"], na.rm = TRUE), digits = 0)

#Updating the Total column for the year
data.AZ[, 'Total'][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- sum(data.AZ[2:12][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010])

#Updated plot with outlier adjusted
ggplot(data = data.AZ, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona (Outlier adjusted)")

The second state having CV > 1 is Kansas. Let’s look at its plot.

#Total by column for KS

data.KS <- filter(datafr, State == "KS")
ggplot(data = data.KS, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Kansas")

2019 pain level 1 is likely to be an outlier but unlike the case with Arizona, we cannot be sure since we don’t have the data for the years after 2019. I will not alter this point but exclude it in some analysis.

Finally, we have West Virginia.

#Total by column for WV

data.WV <- filter(datafr, State == "WV")
ggplot(data = data.WV, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in West Virginia")

WV does not seem to have outliers, since there is an increase and decrease in all columns over 4 years from 2015-2019.

Now that we have investigated outliers, let us go back to the Total number of animals vs. Year plots and revise them.

#Updating dataframe and total plots

datafr[datafr['State'] == "AZ" & datafr['Column'] == "C" & datafr['Year'] == 2010, ] <- data.AZ[data.AZ['Column'] == "C" & data.AZ['Year'] == 2010, ]
datafr[530, 2:13] <- colSums(datafr[478:529, 2:13])

data.total <- filter(datafr, State == "REPORT TOTAL")

#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))
total.plot.1.updated <- ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Total by year and column
total.plot.2.updated <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )
grid.arrange(total.plot1, total.plot2, total.plot.1.updated, total.plot.2.updated, ncol = 2, left = "Before and after outlier adjustment")

Now we observe a more steady decline in the total number of animals over the years.

Next, let us look at a heat map of the percentage change by state over the 12 years. Kansas was excluded from this (set to NA) because it is a potential outlier that interefers with the data presentation done here.

data.state.percentage <- data.by.state.year[data.by.state.year['Year'] == 2008 | data.by.state.year['Year'] == 2019, ] %>%
group_by(State) %>%
summarize(percent = 100*(Total - first(Total))/first(Total)) %>%
filter(percent != 0)
data.state.percentage[data.state.percentage$State == "KS", "percent"] <- NA
data.state.percentage
## # A tibble: 52 x 2
## # Groups:   State [52]
##    State percent
##    <chr>   <dbl>
##  1 AK      -76.1
##  2 AL      -18.3
##  3 AR       36.5
##  4 AZ      -20.4
##  5 CA      -40.6
##  6 CO      -16.0
##  7 CT      -58.6
##  8 DC      -28.1
##  9 DE      -60.5
## 10 FL       11.4
## # ... with 42 more rows
to_map1 <- inner_join(usa, data.state.percentage, by = "State")

#Gradient for -ve and +ve
#KS taken as NA because cannot conclude as outlier but very high value

minval <- min(data.state.percentage$percent, na.rm = TRUE)
maxval <- max(data.state.percentage$percent, na.rm = TRUE)
domain <- c(minval, maxval)
colorPal <- c(colorRampPalette(colors = c("red", "yellow"), space = "Lab")(abs(minval)), colorRampPalette(colors = c("yellow", "green"), space = "Lab")(maxval))
##b2182b, #2166ac
pal_fun <- colorNumeric("BuPu", NULL)
pu_message <- paste0(to_map1$State,  
                     "<br>Percentage change in total animals used from 2008 to 2019 (%): ",       
                     round(to_map1$percent, digits = 0))

leaflet(to_map1) %>%
  addPolygons(stroke = FALSE,
              color = 'white',
              fillOpacity = 0.8, smoothFactor = 0.5,
              fillColor = ~get('colorBin')(colorPal, domain)(percent),
              popup = pu_message) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addLegend("bottomright",
            pal=colorBin(colorPal, domain = domain + 1),                             
            values=~domain,                 
            title = 'Percentage change in total animals used from 2008 to 2019 by state (%)',
            opacity = 1) %>%
  addScaleBar()

Next, I want to visualize the patterns in each of the states. The following chunk plots the data for each of the states over the 12 year period.

ggplot(data = datafr %>% filter(State != "REPORT TOTAL"), aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 2.5) + ggtitle("Total number of animals used by each state by pain levels") +
  scale_x_continuous(breaks = c(2008:2019)) + facet_wrap( ~ State, ncol = 5, scales = "free") + theme(
        strip.text.x = element_text(size = 40, color = "blue", face = "bold"),
        plot.title = element_text(size = 50, face = "bold", hjust = 0.5),
        axis.title = element_blank(),
        axis.text = element_text(size = 8),
        legend.position = "bottom",
        legend.justification = "right",
        legend.title = element_text(size = 50),
        legend.text = element_text(size = 50),
        legend.key.width = unit(x = 3, units = "in")
      )

From this grid, we see that out of the top 10 states using the highest number of animals, 6 of them saw a decreasing trend in the study period.

The numbers by species

To visualize the numbers by species, I first performed the same standard deviation analysis as before for species. The following chunks plot the mean and standard deviation over the years by species for the top 3 states, California, Massachusetts and New Jersey.

#Std dev of species for top states
data.CA <- filter(datafr, State == "CA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%

  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
  arrange(desc(sd)) %>% full_join(

filter(datafr, State == "CA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species") %>% 
  arrange(sd)

CA.species <- ggplot(data.CA %>% mutate(species = factor(species, levels = data.CA$species)), aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Mean and Standard deviation of species over 12 years in CA") + ylab("Mean (light blue) and SD (dark blue)") + xlab("Species") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Std dev of species for top states - MA
data.MA <- filter(datafr, State == "MA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%

  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
  arrange(desc(sd)) %>% full_join(

filter(datafr, State == "MA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species") %>% 
  arrange(sd)

MA.species <- ggplot(data.MA %>% mutate(species = factor(species, levels = data.MA$species)), aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Mean and Standard deviation of species over 12 years in MA") + ylab("Mean (light blue) and SD (dark blue)") + xlab("Species") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Std dev of species for top states - NJ
data.NJ <- filter(datafr, State == "NJ") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%

  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
  arrange(desc(sd)) %>% full_join(

filter(datafr, State == "NJ") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species") %>% 
  arrange(sd)

NJ.species <- ggplot(data.NJ %>% mutate(species = factor(species, levels = data.NJ$species)), aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Mean and Standard deviation of species over 12 years in NJ") + ylab("Mean (light blue) and SD (dark blue)") + xlab("Species") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

grid.arrange(CA.species, MA.species, NJ.species, ncol = 1)

These plots show us the variation in each of the species by state. We see that in CA, rabbits are the most commonly used species and they have a high standard deviation. Relative to their numbers, farm animals and guinea pigs also showed high variation. In MA, hamsters showed the highest variation while guinea pigs were the highest used by number. In NJ, hamsters were the top species ad also showed high standard deviation. Relative to their numbers, non-human primates also showed high variation in NJ.

Next I want to visualize the patterns for each of the species over the 12 year period. The following chunk plots the same in one grid.

data.species.pl <- data.total %>%
  group_by(pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE))

data.species.year <- data.total %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "number")

#Plot grid for each species
ggplot(data = data.species.year, aes(x = Year, y = number)) + geom_line(size = 1) + ggtitle("Number of animals by species between 2008-2019") +
  scale_x_continuous(breaks = c(2008:2019)) + facet_wrap( ~ species, ncol = 4, scales = "free") + theme(
        strip.text.x = element_text(size = 8, color = "blue", face = "bold"),
        plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
        axis.title = element_blank(),
        axis.text = element_text(size = 7),
        legend.position = "bottom",
        legend.justification = "right",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10),
        legend.key.width = unit(x = 1, units = "in")
      )

#% change for each species
data.species.percentage <- filter(data.species.year, (Year == 2008 | Year == 2019)) %>%
  group_by(species) %>%
  summarise(percent = 100*(number - first(number))/first(number)) %>%
  filter(percent != 0) %>%
  mutate(pos = percent >= 0) %>%
  arrange(desc(percent)) 
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
#Bar plot of % change for each species
ggplot(data.species.percentage %>% mutate(species = factor(species, levels = data.species.percentage$species)), aes(x = species, y = percent, fill = pos)) + geom_bar(width = 0.8, position = position_dodge(width = 0.8), stat = "identity") + geom_text(data = data.species.percentage %>% filter(pos == FALSE), aes(label = round(percent, digits = 0)), color = 'red', size = 6, hjust=1.2, vjust=0.5) + geom_text(data = data.species.percentage %>% filter(pos == TRUE), aes(label = round(percent, digits = 0)), color = 'cyan', size = 6, hjust=-0.2, vjust=0.5) + xlab(label = "Species") + ylab(label = "Percentage change 2008-2019") + ggtitle("Percentage change from 2008 to 2019 by species") + coord_flip() +
theme(
        plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
        axis.title = element_blank(),
        axis.text = element_text(size = 12),
        legend.position = "none"
      )

From the above plots, we have a clear picture of what the trends were for each of the species as well as their percentage change over the 12 year period. We see that almost all categories saw a decrease. Among these, dogs, hamsters, non-human primates and rabbits saw a steep decline.

Predictive Analysis

I want to find what the numbers would be for the next 5 years if the same trend continues. I fit a linear model to the total number of animals vs. year graph and also made a table to reflect the figures.

m1 <- lm(Total ~ Year, data = data.total1)
summary(m1)
## 
## Call:
## lm(formula = Total ~ Year, data = data.total1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85230 -18560  -5143  25807  52073 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 54799007    6887537   7.956 0.0000124 ***
## Year          -26705       3421  -7.807 0.0000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40910 on 10 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.845 
## F-statistic: 60.95 on 1 and 10 DF,  p-value: 0.00001458
predict(m1, newdata = data.total1)
##         1         2         3         4         5         6         7         8 
## 1176308.3 1149603.7 1122899.2 1096194.7 1069490.1 1042785.6 1016081.1  989376.5 
##         9        10        11        12 
##  962672.0  935967.5  909262.9  882558.4
data.total.1.predict <- data.frame(year = 2019:2024, Total = NA_integer_, round(predict(m1, newdata = tibble(Year = 2019:2024), interval = "confidence"), digits = 0))

ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line() + geom_smooth(formula = "y~x", method = "lm") + geom_line(data = data.total.1.predict, mapping = aes(x = year, y = fit), linetype = "dashed", color = "red") + geom_ribbon(data = data.total.1.predict, mapping = aes(ymin = lwr, ymax = upr, x = year), fill = "red", alpha = 0.2) + scale_x_continuous(breaks = c(2008:2024)) + scale_y_continuous(breaks = seq(700000, 1300000, by = 100000), labels = seq(7, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") + theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

data.total.1.predict %>% select(year, fit)
##   year    fit
## 1 2019 882558
## 2 2020 855854
## 3 2021 829149
## 4 2022 802445
## 5 2023 775740
## 6 2024 749036

The above plot shows the the prediction line with the confidence interval in red. The table shows the predicted numbers for years 2020-2024.

Conclusion

To summarize, I found that the number of animals used in biomedical research has seen a decreasing trend over the years 2008-2019, with a reduction of 24%. The linear model predicts that if the same trend continues, the numbers would reach roughly 75,000 by 2024. A point to note is that this is the data only for species covered under the Animal Welfare Act and hence it may not be representative of all the animals used. It is very likely that in reality the numbers have increased. Improvements in genetic engineering lead to easy creation of transgenic mice and rats in the 1980s which are estimated to be over 95% of the animals used in research today. This could have resulted in an ongoing decline in the usage of other mammals, as seen in the patterns for dogs and non-human primates for example.

California is the top state in terms of total numbers as well as across most pain categories. Among the top 10 states, 6 of them show a decreasing trend, including California.

The improvements in non-animal methods are also likely to have caused the decline in animal usage. For example, in the cosmetic industry, the “cruelty-free” tag gained traction in the public eye and led to an ongoing transformation in the beauty industry to shift away from performing toxicity testing on rabbits to using in vitro human skin models.

Through this project, I used many of the concepts I learned in BMIN 503. I was able to appreciate the work that goes into cleaning and transforming data to a usable form. I was able to make custom plots and GIS maps to reflect my data and also build a simple linear model.

References

Akhtar, Aysha. 2015. “The Flaws and Human Harms of Animal Experimentation.” Cambridge Quarterly of Healthcare Ethics 24 (4): 407–19. https://doi.org/10.1017/S0963180115000079.
Harding, A. 2004. “More Compounds Failing Phase 1.” http://www.the-scientist.com/?articles.view/articleNo/23003/title/More-compounds-failing-Phase-I/.
Pippin, John J. 2012. “Animal Research in Medical Sciences: Seeking a Convergence of Science, Medicine, and Animal Law.” S. Tex. L. Rev. 54: 469.